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. Abstract: Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. 

! We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia 

Q\ \ "Fermi" class of GPGPUs. A new "padded jagged diagonals storage" (pJDS) format is proposed which 

may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our test 

scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and 

achieves 95% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify 

performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi- 

GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory 

parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of commu- 

£N| ■ nication and computation. 
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The solution of large eigenvalue problems or extremely sparse systems of linear equations is a cen- 
tral part of many numerical algorithms in science and engineering. Sparse matrix-vector multipli- 
cation (spMVM) is often the dominating component in such solvers, and may easily consume most 
of the total runtime. General-purpose computation on graphics processing units (GPGPUs) is an at- 
tractive option for this operation due to the large memory bandwidth available to high-end graphics 
chips and their inherent massive parallelism. Implementations of spMVM on GPGPUs have been a 
field of active research in recent years [1, 2, 3], and several storage formats have been proposed. Out 
of those, the ELLPACK-R format [ ] has gained widespread acceptance. However, although there is 
a long history of distributed-memory parallel spMVM codes (see [ ] and references therein), there 
is to our knowledge no efficiency or feasibility analysis of multi-GPU spMVM. 

This work has two goals: It provides an alternative sparse MVM storage format that has a 
significantly smaller memory footprint than ELLPACK(-R) but provides better performance in most 
cases on modern nVidia GPGPUs. Furthermore, it extends previous work on distributed-memory 
spMVM for general matrices to multiple GPGPUs. 



1.2 Testbed 



The nVidia "Fermi" class of GPGPU-based accelerators (Tesla C/M20X0) used for the benchmarks 
implement the "GF100" architecture and comprise 14 streaming multiprocessors (MPs), each with 



32 in-order cores. One core can execute one single-precision (SP) multiplication and one addition 
per cycle, which leads to an overall peak performance of 896 flops per cycle on the whole chip 
at clock frequencies above 1 GHz. At double precision (DP) the theoretical peak performance is 
halved. The boards are currently available with device memory sizes of 3 (C2050) or 6 GB (C2070), 
and feature deactivatable ECC protection. In streaming benchmarks the device memory delivers 
about 91 GB/s sustained with ECC enabled (120 GB/s w/o ECC) [5]. All cores share a 768 kB L2 
cache, whose detailed specifications are undisclosed. 

The cores in an MP are driven in a single instruction multiple data (SIMD) manner (also termed 
SIMT model, where "T" stands for "threads"). All threads running on an MP are controlled by 
a simple instruction scheduler that can switch quickly between chunks of threads called warps, in 
order to hide latencies. A warp (or a subset of it) is the actual SIMD unit on this device, and it 
is essential that consecutive threads in a warp access consecutive memory locations (this is called 
coalescing). Although still important, coalescing constraints have been somewhat relaxed with the 
GF100 architecture due to its L2 cache, which was not present on earlier models. 

The parallel runs have been conducted on the Dirac 1 GPGPU cluster at the National Energy 
Research Scientific Computing Center (NERSC) in Berkeley. This cluster features 50 GPU nodes, 
of which 44 contain one nVidia Tesla C2050 card with 3 GB of device memory. 

1.3 Test matrices 

HMEp: This matrix originates from the quantum-mechanical description (using the so-called 
Holstein-Hubbard model) of a one-dimensional solid with six lattice sites, populated with six elec- 
trons coupled to 15 phonons (quantized lattice vibrations). The resulting matrix of dimension 
6.2 x 10 6 is very sparse, with approximately 15 non-zeros per row. It also contains contiguous 
off-diagonals of length 15,000. 

sAMG: This matrix was generated by the adaptive multigrid code sAMG (see [6, 7], and references 
therein) for the irregular discretization of a Poisson problem on a car geometry. Its matrix dimension 
is 3.4 x 10 6 with an average of iV nzr ~ 7 entries per row. 

DLR1: This matrix comes from an adjoint problem computation (turbulent transonic flow over a 
wing) with the TAU CFD system of the German Aerospace Center (DLR). TAU performs complex 
flow simulations on unstructured hybrid grids. The associated grid had 46,417 points (6 unknowns 
in each point), and the resulting matrix is nonsymmetric with a dimension of 2.8 x 10 5 and an 
average of N nzt ^144 entries per row. 

DLR2: This matrix stems from a linear problem for an aerodynamic gradients calculation. A 
transonic inviscid flow over a wing was simulated with TAU. The associated grid had 108,396 
points, and the matrix is nonsymmetric with a dimension of 5.4 x 10 5 and an average of 7Y nzr « 315 
entries per row. It consists entirely of dense 5x5 subblocks. 

UHBR: The last matrix originates from aeroelastic stability investigations of an ultra- high bypass 
ratio (UHBR) turbine fan with a linearized Navier-Stokes solver [8]. This solver is part of the paral- 
lel simulation system TRACE (Turbo-Machinery Research Aerodynamic Computational Environ- 
ment) which was developed by DLR's Institute for Propulsion Technology. Its matrix dimension is 
4.5 x 10 6 with an average of N nzi ^ 123 entries per row. 



http://www.nersc.gov/users/computational-systems/dirac/ 




2 GPGPU spMVM 

2.1 Introducing the padded JDS formats 

ELLPACK-R [3] is a variant of the original ELLPACK storage format [1,9] and sets today the 
standard for implementing spMVM operations on GPGPUs. ELLPACK(-R) should be used if no 
regular substructures such as off-diagonals or dense blocks can be exploited. The idea is to compress 
the rows by shifting all non-zero entries to the left (first step in Fig. 1) and storing the resulting 
N x N™ x rectangular matrix 2 column-by-column consecutively in main memory, where N™ x is 
the maximum number of non-zeros per row. Thus, in contrast to CPU storage formats, ELLPACK 
contains zero entries (white boxes in Fig. 1). Thread parallelization of the spMVM is row-wise 
by assigning consecutive rows to the threads of a block (i.e., outer loop iterations in Listing 1 are 
mapped to threads in a round-robin way). 

Listing 1: The standard ELLPACK-R spMVM kernel 

1 f or (i=0 ; i < N ; ++i) 

2 for(j=0; j < rowmax [i] ; ++ j ) 

3 c[i] += val[j*N + i] * rhs [ col_idx[j*N + i] ]; 

The increased memory footprint of the ELLPACK format ensures load coalescing within thread 
warps for access to the matrix entries (val [] ) and the index array (col_idx [] ), which points to the 
right hand side (RHS) vector elements (rhs [] ). While data alignment became of minor importance 
with the latest nVidia GPGPU generations, load coalescing is still a must for attaining reasonable 
data transfer rates. In the original ELLPACK scheme the threads were still loading and operating 
on the zero matrix entries, wasting memory bandwidth and compute resources. 

The ELLPACK-R scheme uses the same storage format, but threads only execute non-zero con- 
tributions (the number of non-zeros per row is stored in rowmax [] ), avoiding redundant data trans- 
fers. However, all threads of a warp occupy on-chip resources until the thread executing the longest 
row has finished. Figures 2a and b compare the overhead of the ELLPACK(-R) schemes assuming 
a warp size of four threads. ELLPACK-R reduces computation and data accesses to the possible 
minimum (arrows in Fig. 2b). However, the imbalanced row lengths impose reservation of unused 
hardware units (light boxes). Moreover the redundant storage (indicated by white and light boxes) 
stays the same. 



2 Typically the matrix dimension N must also be padded to a multiple of the warp size. 




Figure 2: Scheduling patterns and required 
storage size for the different matrix formats 
assuming a four thread warp. Arrows in- 
dicate computation and data accesses ex- 
ecuted by the threads. White boxes show 
redundant data storage, and light boxes in- 
dicate redundant data storage and useless 



(a) ELLPACK (b) ELLPACK-R (c) pJDS hardware reservation. 

A simple idea, derived from the Jagged Diagonals Storage (JDS) format used for vector comput- 
ers, can drive the matrix format towards better utilization of compute resources and storage. First 
the rows of the ELLPACK scheme are sorted according to the number of non-zeros, starting with 
the longest row ("sort" step in Fig. 1). Then, blocks of b r consecutive rows (where b r should be 
the warp size) are padded to the longest row within the block ("pad" step in Fig. 1). We call the 
result "padded Jagged Diagonals Storage" (pJDS). This maintains load coalescing while most of the 
zero entries can be eliminated. Since the columns typically have different lengths, a (small) array 
col_start [] of size (N™ x x 4 byte) is required to store the starting offset of each column. The 
pJDS kernel is shown in listing 2. 

Listing 2: The spMVM kernel of the pJDS format 

1 f or (i=0 ; i < N ; ++i) 

2 for(j=0; j < rowmax[i]; ++j){ 

3 col_offset = col_start [ j ] ; 

4 c[i] += val [ col_of f set + i] * rhs [ col_idx [col_of f set + i] ] ; 

5 } 

It maintains the structure and simplicity of the ELLPACK-R kernel but provides potential for (sub- 
stantial) data reduction and better hardware utilization. The main drawback of the format is that 
the spMVM operation needs to be performed in a permuted basis. However, for most iterative sp- 
MVM algorithms such as Krylov subspace methods, permutation of the indices needs to be done 
only before the start and after the end of the algorithm, while the complete iterative scheme works 
on the permuted elements. On the downside, the permutation of the matrix rows can destroy matrix 
structures such as off-diagonals or local dense blocks, leading to a loss of load coalescing or cache 
reuse on the RHS vector. 

The sparsity pattern determines the data reduction potential of pJDS. If the matrix has a constant 
row length (rowmax [] =A r ™ x ), ELLPACK and pJDS both have no storage overhead (N x N™ x non- 
zeros). On the other hand, if there is one fully populated row and a single entry in all others, the 
plain ELLPACK format would store the full matrix, i.e., N x N elements. In pJDS it is sufficient 
to hold b r x N + N — b r = (b, ■+ 1) x N — b r entries. At a typical value of b r = 32 one can expect a 
substantial reduction of the memory footprint for matrices with a wide variation in row lengths. 
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Figure 3: Row 
length distribu- 
tion histograms 
of the matrices 
described in 
Sect. 1.3. The 
bin size is 1 for 
all cases. 
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Table 1 : Data reduction of pJDS in comparison to the ELLPACK format and performance (in GF/s, 
excluding data transfers) of the different storage formats on an nVidia Fermi C2070 GPGPU. The 
best performance for each matrix and execution environment (double precision with ECC enabled 
on vs. single precision with ECC disabled) is highlighted. The last row shows the performance of a 
dual-socket (12 core) Intel Westmere node. See [ 4] for implementation and hardware details. 



The row length histograms (Fig. 3) for DLR1/2, sAMG, and HMEp show that there is plenty of 
data reduction potential for those matrices. DLR1 should benefit least from the pJDS format since 
it has the lowest relative width (max(rowmax [] )/min(rowmax[] )~ 2) and most of the weight is 
clustered close to the maximum row length, i.e., 80% of the rows have a length of 0.8 x N™ x . 
In contrast, the longest row of sAMG is more than four times larger than the smallest one, and 
short rows account for most of the weight. The data reduction rates finally achieved by using pJDS 
instead of ELLPACK follow this qualitative discussion and are shown in Table 1. Considering the 
limited amount and high cost of device memory on GPGPUs, pJDS delivers a useful shrink of the 
memory footprint for spMVM on GPGPUs; e.g., the DLR2 matrix fits (in double precision) on an 
nVidia Fermi C2050 GPGPU only when using the pJDS format. The overhead of pJDS compared 
to a minimum implementation (storing the non-zeros only) is less than 0.01% for the matrices 
considered here (choosing t b = 32). 



The improved hardware utilization by pJDS (compare Figs. 2b and c) is also reflected in the 
performance numbers. In most scenarios gains of up to 30% can be achieved with pJDS, while the 
largest penalty is limited to 5%. Since the row permutation may destroy regular substructures, the 
DLR2 and HMEp matrices do show some performance drop or only moderate speed-ups due to 
reduced cache reuse and load coalescing for the RHS vector. This problem is more severe on older 
GPGPU generations without L2 cache, such as the Tesla CI 060. Here it is also necessary to map 
the array holding the column starting offsets (col_start [] ) to the texture cache. 

In summary, the pJDS format presents a very attractive alternative to the ELLPACK-R scheme 
on modern GPGPUs both in terms of performance and memory footprint. 



2.2 GPGPU performance model and PCIe transfer impact 

Due to the small (or non-existent) data cache on GPGPUs, the expected speedup compared to a 
multicore socket is usually smaller than the ratio of memory bandwidths. The worst-case code 
balance of the ELLPACK and pJDS kernels for double precision is 
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The parameter 1 /N™ x < a < 1 quantifies the possible re-use of RHS data from cache: If there is 
no cache, i.e., if each load to the RHS vector goes to memory, we have a = 1. Hence, cache is able 
to reduce the balance by some amount. In the ideal case a = 1 /N^ x each RHS element has to be 
loaded only once. This corresponds to the K = case in [4]. Note that B^J" may change from block 
to block due to different values of N™ x , and that the col_start [] array is assumed to always come 
from cache. In the following we assume an average value iV nzr for the number of non-zeros per row. 

The bandwidth model (1) is only valid for the kernel execution on the device, and does not 
include the data transfers required to bring the RHS vector to the GPU and the result back to the host. 
However, one can extend the model to incorporate those overheads. Since two distinct bandwidths 
are involved we now look at the expected wallclock times for the pure spMVM (7mvm) an d the 
required data transfer of the RHS and LHS vectors over the PCIe bus (7pci): 
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This shows that a low PCIe bandwidth has less impact on the overall execution time if iV nzr is large, 
hence we can estimate the range of favorable N nzi values: Setting ?mvm < 7pci> assuming more 
than 50% penalty from the PCIe transfers, we arrive at 

2(Wfr3-l) (3) 
a + 3/2 

In the worst case, a = 1 /N nzr and #gpu > 205pci lead to N nzi < 25. On the other hand, if a = 1 
and 5gpu ~ lOBpci we have iVnzr < 7. Thus we do not expect a significant benefit from GPGPU 
acceleration for the HMeP and sAMG matrices described above, since those have iV nzr ~ 15 and 7, 
respectively. 

If we want less than 10% penalty from PCIe transfers (Tmym > lOTpci) we get 
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Figure 4: Timeline for GPGPU-based spMVM kernel including host data transfers with a dedicated 
host thread for asynchronous MPI communication (thread 0). The "local gather" is the collection of 
data to be sent to other processes into a contiguous buffer. 

so at 5gpu ~ 105pci and a = 1 a value of iV nzr > 80 is sufficient. This is certainly satisfied for all 
DLR matrices. In the worst case, i.e., at #gpu ~ 205pci and a = 1 /N nzr one arrives at N nzr > 266; in 
this case we can expect a measurable impact of PCIe transfer overhead for the matrices considered 
here. 

3 Distributed-memory spMVM parallelization 

As described in Sect. 2.2, matrices with small N nzi are no good candidates for GPGPU acceleration 
since the required PCIe transfers for the RHS and LHS vectors dominate the runtime. Although this 
penalty is somewhat mitigated by the fact that in some real applications parts of those vectors may 
be kept on the device, all data that has to be communicated using MPI must also cross the PCIe bus 
to the GPGPU. For the HMEp (sAMG) matrix we arrive at a single-GPU performance level of 3.7 
(2.3) GF/s, which is already below the capability of a typical dual-socket server node (see Table 1). 
Hence, we restrict the discussion in this section to the DLR1 and UHBR matrices. Although they 
also suffer from PCIe transfers to some extent (10.9 GF/s vs. 12.9 GF/s for DLR1), there is still a 
substantial advantage over the CPU version. 

All runs were performed in double precision and with active ECC on the NERSC Dirac clus- 
ter. The ELLPACK-R format was used throughout, since the matrix storage format is of minor 
importance for the double precision case (see Table 1) and for the concepts we want to demonstrate 
here. An implementation of the multi-GPGPU code with the pJDS format and an analysis of its 
performance implications is ongoing work and will follow the strategy described in [10]. 

3.1 Multi-GPGPU spMVM 

The basic design patterns and choices described in [4] for distributed-memory parallel sparse MVM 
also apply for the multi-GPGPU case. We distinguish between vector mode, which resembles the 
programming style on vector-parallel machines, and task mode, which dedicates host resources 
(threads) to different tasks, i.e., communication and computation. In this work we consider three 
alternatives: 

Vector mode without overlap of communication and computation. The required communication 
to distribute the nonlocal RHS elements among the processes is separate from the actual spMVM 
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Figure 5: Strong scaling results for DLR1 (left) and UHBR (right) on the Dirac cluster. Due to 
memory restrictions on the C2050 cards it was not possible to run the UHBR case on fewer than 
five nodes. 



communication, which is performed in a single step. 

Vector mode using naive overlap of communication and computation by nonblocking MPI. The 
spMVM must be split into a local and a nonlocal part, and the former may be overlapped with MPI. 
Since the result vector must be written twice, there is a slight increase in memory traffic, which 
adds another 8 /N nzr bytes/flop to the code balance (1). Due to the rather large N nzi of the DLR1 
and UHBR matrices we expect a performance penalty of below 10%, though. Since most MPI 
libraries do not support asynchronous nonblocking point-to-point communication, we do not expect 
this variant to have any advantage even over vector mode without overlap. 

Task mode using a dedicated thread for MPI in order to implement reliably asynchronous commu- 
nication. Figure 4 shows an event timeline that visualizes the different tasks executed on two host 
threads (or more if there are multiple GPGPUs in a node) and the GPGPU. Depending on the ratio 
of communication to computation time, the possible performance benefit can be at most a factor of 
two. At strong scaling we expect task mode and vector mode performance to converge. 

3.2 Performance results 

Figures 5a and 5b show strong scaling results for the DLR1 and UHBR matrices, respectively, on 
the Dirac cluster. Task mode achieves better performance than any of the vector modes in both 
cases; however, the details differ considerably: 

DLR1 has a rather small dimension of 2.8 x 10 5 , so that only 8750 rows (about 1.3 x 10 6 non- 
zeros) are left per GPGPU at 32 nodes. The smallness of the per-GPGPU subproblem leads to a 
substantial performance drop, which mainly originates from the nonlocal part in the naive vector and 
task mode versions. It can, however, be partially compensated by asynchronous communication. At 
larger node counts the performance of all variants starts to converge, as expected. 

UHBR has a much larger number of non-zeros at a similar N nzi as DLR1 , and thus does not show 
an analogous per-GPGPU performance breakdown when scaling up the node count. Scalability is 
very good in task mode with a parallel efficiency of 84% at 32 nodes (about 70% with naive overlap 
vector mode). Since the communication requirements are weaker than for DLR1, we do not see a 
similarly large benefit from overlapping communication at the node counts accessible on the cluster 



used. 



4 Conclusions and outlook 

We have introduced a new "padded JDS" sparse matrix format, which is suitable for sparse matrix- 
vector multiplication on GPGPUs at similar or better performance levels than the popular ELLPACK- 
R format, with a potential for significant memory savings. 

Via suitable performance models we have derived a condition for the average number of non- 
zeros per matrix row that guarantees a useful performance benefit of GPGPU-based spMVM in 
comparison to standard server nodes, the main parameter being the ratio between PCI express band- 
width and GPGPU memory bandwidth. 

Finally we have extended previous work on efficient distributed-memory hybrid (MPI+OpenMP) 
spMVM parallelization to the multi-GPGPU case. Using dedicated host threads for explicitly 
asynchronous MPI communication we were able to improve significantly over naive "vector-like" 
approaches and show the potentials and limitations of this solution. 

Future work will cover more extensive scaling studies on larger GPGPU clusters, an implemen- 
tation of the pJDS format in the multi-GPGPU code, a thorough investigation of the performance 
degradation with strong scaling, and the application of our results to a production- grade eigensolver. 

During the preparation of the manuscript it came to our attention that other research groups 
have devised sparse matrix formats that share some features with pJDS, most notably the "sliced 
ELLPACK" and "sliced ELLR-T" formats [11, 12]. A thorough comparison of pJDS with those 
alternative approaches is work in progress. 
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